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Abstract 

POLE and POLI are paralogs encoding low-fidelity, class Y, DNA polymerases involved in replication of damaged DNA in the human 
disease xeroderma pigmentosum variant. Analysis of genomic regions for human and mouse homologs, employing the analytic tool Genome 
Cryptographer, detected low-repetitive or unique regions at exons and other potential control regions, especially within intron I of human 
POLH. The human and mouse homologs are structurally similar, but the paralogs have undergone evolutionary divergence. The information 
content of splice sites for human POLH, the probability that a base would contribute to splicing, was low only for the acceptor site of exon 
II, which is preceded by a region of high information content that could contain sequences controlling splicing. This analysis explains 
previous observations of tissue-specific skipping during mRNA processing, resulting in the loss of the transcription start site in exon II, in 
human tissues. 

© 2003 Elsevier Inc. All rights reserved. 

Keywords: Pol H; Pol I; Xeroderma pigmentosum variant; Alternative splice; Donor; Acceptor; DNA damage 



Several new classes of DNA polymerases have recently 
been identified in human cells, one being related to the 
bacterial class of mutagenic polymerases involved in the 
SOS repair system [1]. These class Y polymerases have 
reduced fidelity and are able to replicate a variety of dam- 
aged DNA templates with relaxed specificity [2-5]. The 
catalytic regions of the genes specifying these polymerases 
are in many cases homologs of the catalytic regions of the 
bacterial UMUC'D polymerase [1]. The catalytic regions 
have larger active sites than the replicative class B poly- 
merases (Pol a, S, e) and can accommodate damaged bases 
or covalent adducts on template strands [6,7]. 

Recent work on these polymerases was stimulated by the 
discovery that the gene for the xeroderma pigmentosum 
variant (XP-V), a human disorder exhibiting high levels of 
UV-induced carcinogenesis, was a DNA polymerase [2,4]. 



* Corresponding author. Fax: +1-415-476-8218. 
E-mail address: jcleaver@cc.ucsf.edu (J.E. Cleaver). 



The XP-V gene, hRad30A, Pol -n or POLH, on chromosome 
6p21, has a paralog in the human genome, hRad30B or 
POLI, on chromosome 18 [8-10], but only a single copy is 
found in yeast, yRad30 [11]. The two genes represent an- 
cient duplications that produce polymerases with overlap- 
ping specificities for replicating damaged DNA that can 
partially compensate for one another, as is commonly found 
for many genes in eukaryotic cells [12]. We previously 
reported that POLH undergoes significant amounts of alter- 
native splicing; in the testis and fetal liver exon II, which 
encodes the ATG start site, is frequently spliced out [13]. 
We therefore conducted a detailed analysis of the genomic 
regions containing the POLH and POLI genes in human and 
mouse and determined the splicing efficiency in the regions 
of each intron/exon junction in human POLH. This ap- 
proach can be used to understand the consequences of gene 
duplication to produce paralogs and the causes of exon 
skipping. 

Genome scanning techniques such as comparative 
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Fig. 1 . Genome Cryptographer analysis of mouse (top) and human (bottom) POLH. Histograms of colored bars are shown at 1 -kb intervals across the genomic 
regions containing POLH. The legend for the color of each bar is shown at the top. The gene is indicated above the histogram, annotated to show exons as 
small blocks on a line denoting the span of the gene. CpG regions are shown as histograms below the x axis line. Orange circles denote ESTs. 





Fig. 2. Genome Cryptographer analysis of mouse (top) and human (bottom) POLL Histograms of colored bars are shown at 1-kb intervals across the genomic 
region containing POLL The legend for the color of each bar is shown at the top. The gene is indicated above the histogram, annotated to show exons as 
small blocks on a line denoting the span of the gene. CpG regions are shown as histograms below the x axis line. Orange circles denote ESTs. 
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genomic hybridization (CGH), restriction landmark genome 
scanning, and analysis of loss of heterozygosity have 
mapped numerous regions of recurrent genome copy num- 
ber abnormalities in human solid tumors [14]. Although 
these mapping techniques have more often been used in 
analysis of genomic changes associated with malignancy, 
they can also be used for detailed analysis of individual 
genes in their native state. We used the suite of software 
tools collectively called Genome Cryptographer (GC) to 
facilitate integrative analysis. GC collects genome sequence 
information from multiple databases and visually displays it 
in analysis intervals (AIs) of constant width along the ge- 
nome. Displayed information includes CpG density, se- 
quence-tagged sites, expressed sequence tag (EST) clusters, 
locations and densities of repeated sequences (e.g., Alus, 
SINEs, LINEs), duplicons, similarities with syntenic murine 
sequences, known genes, and genome copy number deter- 
mined using array CGH. This analysis produces a detailed 
map of the genomic landscape of these regions. We previ- 
ously applied GC to the analysis of 1.2 Mb of 20ql3.2 
because it is amplified in a wide range of tumor types 
[14-16]. We have now applied it to the analysis of the 
chromosomal regions carrying POLE and POLI and also 
applied information theory to the analysis of the fine struc- 
ture of intron/exon junctions. This has enabled us to map 
splice sites and their strengths and coordinate this informa- 
tion with the GC analysis to explain some of the details of 
POLE gene expression. 



Results and discussion of GC sequence analyses 

The results obtained from GC analysis for human and 
mouse POLE and POLI at 1-kb intervals across their re- 
spective chromosomal regions revealed the distribution of 
repetitive elements in intronic regions and regions of low 
complexity corresponding to many of the exons (Figs. 1 and 
2). Coding exons appear as valleys in these distributions, 
repetitive regions are represented by colored bars. POLE 
appeared to contain more SINE elements (red bars) (Fig. 1) 
and POLI more satellite regions (green bars) (Fig. 2). CpG 
islands were evident at the 5' end of both genes and addi- 
tionally in the intron I region of POLE. Although POLE 
and POLI have a common origin before the evolution of the 
mammalian clade, based on their primary coding sequence, 
the intronic regions have undergone considerable change in 
their distribution of repetitive elements. Although each 
polymerase gene retains considerable similarity across spe- 
cies, the two polymerases have diverged from one another, 
as expected for paralogs resulting from gene duplication and 
evolutionary drift [12]. 

In our initial analysis a valley in intron IV (position 
26000, Fig. 1, bottom) of the human POLE appeared to 
correspond the first exon of a gene, exportin-5, that was 



transcribed in the opposite direction [13,17], Exportin-5 is 
involved in nuclear export of double-stranded RNA binding 
proteins [17]. Subsequently the assembly of this region was 
changed. The exportin-5 gene now appears not to overlap 
with POLE. In the human genome a space of about 2 kb lies 
between the 5 ' ends of the exportin-5 and POLE genes; in 
the mouse there is negligible space between the two genes. 
The valley in exon IV suggests that a residual or pseudo- 
exon may remain in this region. The promoters of POLE 
and exportin-5 are still likely to overlap in both human and 
mouse genomes. Deletion of the 5' end of PolE in initial 
attempts to make a PolE knockout mouse resulted in em- 
bryo lethals possibly because of simultaneous deletion of 
regions of both exportin-5 and PolE. Targeting the 3' end of 
the gene should therefore be more successful in making a 
viable knockout mouse, because human patients who have 
chain-terminating mutations that result in no protein being 
synthesized are viable [18]. 

The architecture of the POLE genomic region is typical 
of many human genes, especially those associated with 
human disease, in having an untranslated first exon, a long 
first intron, and juxtaposed genes transcribed in opposite 
directions [19]. Of particular interest is the intron I region of 
human POLE, because we have previously observed tissue- 
specific splicing that eliminated exon II [13]. This region 
contains approximately 1 to 2 kb of very low complexity 
sequence and a high frequency of CpG sequences (Fig. 1, 
bottom). It is likely therefore that this region may have 
importance in regulating the efficiency of splicing and sub- 
sequent expression of an alternatively spliced variant of 
POLE lacking exon II. This exon contains the translation 
start site and therefore alternative splicing may be a mech- 
anism of posttranscriptional regulation via inactivation of 
the message. 

The murine PolE genomic region resembles human 
POLE in its pattern of repetitive elements (Fig. 1, top). 
There are regions of low complexity at the 5 ' and 3 ' ends of 
the gene and aligning with some but not all the coding 
exons. There is, however, no low-complexity region in 
intron I in the mouse gene. We therefore predict that elim- 
ination of exon II by alternative splicing that has been 
observed in specific human tissues such as lung, testis, and 
embryonic liver will be less likely to occur in mouse tissues 
[13]. 



Results and discussion of splice site analyses 

Splicing efficiency is determined by the sequence con- 
text around each individual splice site. To analyze splicing 
efficiency in human POLE, to understand the regulation of 
the gene in the region of alternative splicing, we have 
calculated the "individual information variable" (R v Meth- 
ods Eq. (2)). This value represents the probability that a site 
acts as an acceptor or donor in splicing. In theory, R x values 
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Table 1 

Splice site /fj values and locations and exon locations through the POLH gene 



Type 


Value 


Location (Donor) 


Location (Acceptor) 


Type 


Value 


Location (Donor) 


Location 


Donor 


5.7 bits 


13965 




< Exon 


No. 7 


41431 


41550 


> Exon 


No. 1 


13747 


13979 


Acceptor 


5.3 bits 


41436 




> Donor 


7.4 bits 




13980 


Donor 


3.0 bits 




41493 


Acceptor 


3.8 bits 


14018 




Acceptor 


3.0 bits 


41521 




Acceptor 


2.9 bits 


14057 




Acceptor 


4.5 bits 


41527 




Donor 


2.5 bits 




14060 


Acceptor 


3.0 bits 


41521 




Acceptor 


3.5 bits 


19817 




Acceptor 


4.j DltS 


41 j J. / 




Donor 


3.2 bits 




19818 


Acceptor 


2.9 bits 


42107 




< Acceptor 


3.4 bits 


19854 




Acceptor 


3.3 bits 


42143 




< Exon 


No. 2 


19855 


19995 


> Exon 


No. 7 


41431 


41550 


Acceptor 


4.4 bits 


19919 




> Donor 


9.2 bits 




41551 


Donor 


2.6 bits 




19989 


Donor 


3.5 bits 




41555 


Donor 


6.5 bits 




19992 


< Acceptor 


7.6 bits 


42153 




> Exon 


No. 2 


19855 


19995 


< Exon 


No. 8 


42154 


42277 


> Donor 


5.7 bits 




19996 


Donor 


4.8 bits 




42234 


Donor 


4.7 bits 




20031 


> Exon 


No. 8 


42154 


42277 


Acceptor 


4.4 bits 


20540 




> Donor 


10.8 bits 




42278 


< Acceptor 


11.3 bits 


20545 




Acceptor 


4.4 bits 


42317 




< Exon 


No. 3 


20546 


20680 


Donor 


4.9 bits 




4232U 


Donor 


5.6 bits 




20556 


Donor 


3.1 bits 




42324 


Donor 


5.8 bits 




20593 


Donor 


5.7 bits 




42328 


Donor 


2.5 bits 




20646 


Donor 


2.9 bits 




42338 


Donor 


4.4 bits 




20650 


Acceptor 


3.2 bits 


42746 




> Exon 


No. 3 


20546 


20680 


< Acceptor 


4.7 bits 


42792 




> Donor 


8.1 bits 




20681 


< Exon 


No. 9 


4z fyj> 




Donor 


2.7 bits 




20750 


Donor 


2.4 bits 




42793 


Donor 


5.9 bits 




20770 


Donor 


3.9 bits 


42801 




Donor 


6.1 bits 




24739 


> Exon 


No. 9 


42793 


42858 


< Acceptor 


5.7 bits 


24810 




> Donor 


6.1 bits 




4zo!>y 


< Exon 


No. 4 


24811 


25028 


Acceptor 


6,5 bits 


42883 




Donor 


3.3 bits 




24811 


Acceptor 


3.8 bits 


A*)QQA 

4zoo4 




Donor 


3.4 bits 




24833 


Donor 


4.4 bits 




4zofit> 


> Exon 


No. 4 


24811 


25028 


Acceptor 


4.4 bits 


42933 




> Donor 


5.4 bits 




25029 


Acceptor 


3.0 bits 


48086 




Acceptor 


3.3 bits 


25069 




Acceptor 


3.5 bits 


48090 




Donor 


8.3 bits 




35213 


< Acceptor 


1 1.4 bits 


48092 




< Acceptor 


5.5 bits 


35234 




< Exon 


No. 10 


48093 


48262 


< Exon 


No. 5 


35235 


35404 


Donor 


4.3 bits 




48120 


Acceptor 


8.8 bits 


35281 




> Exon 


No. 10 


48093 


AQIfil 


Acceptor 


7.3 bits 


35303 




> Donor 


9.5 bits 




A Q^&1 

4o2oj 


Acceptor 


7.4 bits 


35311 




Acceptor 


3.1 bits 


48278 




> Exon 


No. 5 


35235 


35404 


Acceptor 


10.4 bits 


48291 




> Donor 


7.8 bits 




35405 


Acceptor 


5.4 bits 


48296 




Donor 


3.7 bits 




35460 


Acceptor 


4.0 bits 


48321 




Donor 


2.7 bits 




35469 


Acceptor 


4.5 DllS 


^1 1 01 

j 1 1 yj 




< Acceptor 


9.3 bits 


38526 




Acceptor 


3.8 bits 


51196 




< Exon 


No. 6 


38527 


38630 


< Acceptor 


12.1 bits 


51198 




Donor 


3.8 bits 




38589 


< Exon 


No. 11 


51199 


53181 


Acceptor 


2.8 bits 


38601 




Acceptor 


3.4 bits 


51213 




Acceptor 


2.7 bits 


38609 




Acceptor 


3.8 bits 


51236 




Acceptor 


6.3 bits 


38677 




Acceptor 


2.9 bits 


51260 




> Exon 


No. 6 


38527 


38630 


Acceptor 


4.3 bits 


51267 




> Donor 


9.3 bits 




38631 


Acceptor 


12.5 bits 


51269 




Acceptor 


3.5 bits 


41384 




Acceptor 


2.9 bits 


51279 




< Acceptor 


9.6 bits 


41430 




Exon 


No. 11 


51199 


53181 



Actual exon splice site positions are indicated by boldface; individual donor and acceptor sites marked by > and <, respectively. 



of at least zero are required for an acceptor or donor site to 
exist. Empirically, it has been found that an R t value of at 
least 2.4 bits is almost always required for a splice site to be 



functional [20]. Strong acceptor sites are in the range of 9 to 
10 bits and up. Strong donor sites are in the range of 7 or 8 
bits and up. 
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15500 1B500 23500 28500 

Nucleotide position in the Pol H gene 

alternative splice 



33500 




exon 1 



Fig. 3. Top: Splice site R x values shown for empirically determined splice 
sites in the first four exons of human POLH. Solid inverted triangles, donor 
sites; open inverted triangles, acceptor sites; arrows show direction of 
correct (solid arrows) and alternative (dashed arrow) splicing between 
donor and acceptor sites between exon I and exon II. Bottom: First four 
exons of POLH gene with location of exons shown by solid rectangles and 
splicing shown by connecting lines between donor and acceptor sites. 
Alternative splicing causes skipping of exon II. 



the next exon, III, resulting in the loss of 141 nt of exon II 
[13]. 

The acceptor site of strength 3.5 bits at location 19854 is 
weak but in the correct location to introduce exon II begin- 
ning at 19855 (Fig. 4). There is an acceptor with R { - 3.5 
bits at 19817 and one with R { = 4.4 bits at 19919. These are 
all relatively weak acceptors, but above the absolute limit of 
0 bits for definition of a site and above the empirical limit of 
2.4 bits for functional sites. There is a donor site with R { = 
5.7 bits at 19996, corresponding to the end of exon II at 
19995. However, there are also donors with R { = 6.5 bits at 
19992 and 4.7 bits at 20031. These are moderately strong 
sites. There is even a fairly weak one at 19989 with R { — 2.6 
bits. These sites, along with observed exon II, are illustrated 
in Fig. 4. 

A good starting point for relating individual information 
R t values to binding effectiveness is to regard the binding 
site affinity, and thus the likelihood of use, of a site with 
value /fj as proportional to 2 A (/? i ). With acceptors this weak, 
there could be a significant fraction of exon skipping, as has 
been observed experimentally [13]. There could be as many 
as a half-dozen alternative exons in this situation, with 
binding site affinities within a factor of 2 to 4 of each other. 
This grows to nine sites with affinities within another factor 
of 2, if the site at 20031 is used. If the donor at 19989 is 
active, up to a dozen alternatives are possible. These possible 
alternative exons are shown with the splice sites in Fig. 4. 



Wild-type POLH 

We analyzed the complete genomic region of human 
POLH for the occurrence of all possible donor and acceptor 
sites and calculated their R { values. Many values were in the 
range of 2.5 to 5.5, which is below that usually required for 
effective splicing (Table 1). Most of the splice sites that 
correspond to empirically known splice sites had R t values 
that indicated strong splicing. These values ranged from 4.5 
to 12.1 bits and neighboring sites generally had lower values 
(Table 1, Fig. 3). The major exception was the acceptor site 
for exon II that had a low value of 3.5 bits and had neigh- 
boring sites with comparable to larger values (Table 1, Fig. 
3). This site is skipped in a number of tissues and to some 
extent in cell culture, in preference for the acceptor site of 



Effect of mutations in POLH on splicing efficiencies 

Several XP-V cases have now been analyzed and muta- 
tions have been identified in the N-terminal catalytic do- 
main and at the C-terminal end, which regulates nuclear foci 
formation and PCNA interaction [2,4,21-23]. Yuasa et al. 
[21] report that a G — > C mutation at location 19854, which 
is at the end of intron I, just preceding exon II, results in 
skipping exon II. Our analysis finds that this mutation 
changes the R { value of this acceptor from 3.5 to -3.8 bits 
(Table 2). Thus, by our analysis, this mutation changes a 
weak, but adequate, site to one that is almost certainly not 
functional. This is consistent with skipping exon II, as 
reported. Other mutations in various XP-V patients occur 
sufficiently far from the nearest splice sites that they made 



Fig. 4. Top: Walker analysis of the wild-type genomic sequence of human POLH in the area of exon II. Genomic sequences are shown horizontally, with 
locations given above each in increments of 10 bp. Asterisks indicate locations that are multiples of 5. A brief description of each piece of DNA is given 
above the locations. Individual information contributions are shown below the sequence, with positive contributions pointing up and negative contributions 
pointing down. The positions of splice sites are boxed. The sites are labeled with type, strength (R { value), and location. Exon II is shown as a horizontal 
dashed line between "[*' and ">" symbols. It is initiated by an acceptor site and terminated by a donor site. Bottom: Walker analysis of the wild-type genomic 
sequence of human POLH in the area of exon II, with predictions of possible alternative splice exons. Possible alternative exons consistent with the splice 
sites found are shown as horizontal dashed lines between "[" and ">" symbols. They are initiated by acceptor sites and terminated by donor sites. For this 
set, the acceptors have strengths of at least 3.4 bits, and the donors have strengths of at least 4.7 bits. Here, exon II specified in the data set is predictable 
as 2(b2), and eight other putative alternative exons are indicated. These are coded by piece, exon, acceptor beginning, and donor ending. For example, 
altemative exon 2(b3) begins after acceptor "b", at location 19855, and ends before donor "3", at location 20030. In this case, "b" and "3" are simply ordering 
conventions. 



J.E. Cleaver et al / Genomics 82 (2003) 561-570 



567 



♦198J0 



.** im # 1 2 



^ A k n A K N « I. q ^ t. t .; h 



5870 



0 ^ 4J4J 



• i ? aac * 



,* * t « * a -* c ■* 3 <t t. t. ^ r. > •: > v t -a ^ •» • a ii q j * ■ R j r I; fc *. fc t; t. i fe r. >t a j r. a 9 .% 5* * 



4 



xy/^hnATiO* . tie 
^CCCPt<wr 4.-2 MC3 415*15 



*? - * 3 ■ *f 3 * * 



li y; h * ^ I; <! sj « '•■ t k t * « 



" ' ^ t, ti a a * t 31 q a a * j? t ^ <t t. <v t: c »: - A fc fc C *. fc * » 



Or 



o»ir 



ExonlE 



tn 



.dcc»r <>.!> M-o 



acceptor/ donor sites **» 



t 3 q a c t : c •:• r. t $ % -n c j c t ■;■ <; t < c i" 

f ♦ ~ • • ... XpV tK»".UJVi^i »X*> 

£ — . — . — ... xfiv~hft»iVHiM ,#t t*i> 

... xpv bR9DtllN .rfi !«•:!> 



■4i' 



U9:V) * ♦1564: * '13 = 50 

S* ■<* >* < ft »ai t *- •rttrt.ftn t a ^ t t y I. ^ t t * " ^ 



♦i5€6C * *l)3f7) 

^ * v u i t » r. :ar * ■* a n ^ ii* 



«st*<zk# tn ft ft 



. . * )ii>v~hft>rji ft* < i*^ > 



^ <r r. v 



*1??00 * *15510 • *1?920 



Exon II alternative exons 



* ... apv*V.ft>MflA 

... xjvv K^OlflA 

. . . »itv trj\p: <Ju 
. — t ^ AW .~ufJitnv* 

... xpv*"hP>01 0« 



J*2»C2> 



H«c#p>:nr 4.4 Hit* CrlVViV 



568 J.E. Cleaver et al / Genomics 82 (2003) 561-570 

Table 2 

Effect of mutations on splice site R { values in adjacent potential splice sites 

Cell line Exon Mutation Splice site R t change 0 Effect 

XP1RO 2 gl9854a 19854 A 3.4 to -3.8 Splice site lost 

8 g42185t 42153 A 7.6 to 7.6 Still used 

42160 None to A 3.2 Not used 

XP30RO 2 19964 del 13 19989 D 2.6 to 2.6 Not used 

19992 D 6.5 to 6.5 Still used 

XP7TA 5 35276 del 2 35281 A 8.8 to 7.9 Not used 

35303 A 7.3 to 7.3 Not used 

XP2SA 8 g42185t 42207 A 0.5 to 1.4 Not used 

a A represents an acceptor site, D a donor site. 



small but insignificant changes in the acceptor or donor 
splicing efficiencies (Table 2). 



Final comments 

This analysis by two independent approaches provides a 
sequence-based interpretation of the observed alternative 
splicing seen in human POLE and comparisons between the 
two paralogs POLH and POLI in human and mouse. The 
analysis suggests that further experiments could profitably 
search for regulatory elements and binding proteins within 
the intron I region of low complexity. The analysis further 
indicates that murine PolH does not have the same low- 
complexity region in intron I as seen in human POLH. If 
this region plays a significant role in regulation of alterna- 
tive splicing, then exon II should be skipped only in human 
and not mouse. Experiments to test this are under way. This 
intronic region, coupled with the low R x value for the ac- 
ceptor site of exon II explains the observed loss of exon II 
in a significant number of transcripts. In the testis the alter- 
native splicing appears particularly high, involving loss of 
exon II in almost half the transcripts [13]. This may repre- 
sent a mechanism that partially down regulates the low- 
fidelity polymerase to permit increased activity of recom- 
bination pathways that we have found to be up regulated in 
the absence of Pol H [24]. 



Methods 

Genome Cryptographer analysis. 

GC is a suite of Perl programs designed to facilitate 
megabase scale analysis of genomic sequence [14]. This 
suite is built of separate modules that exchange information 
via intermediate text files. Data in intermediate files are 
written in a consistent format: sequence name, sequence 
length, window size, appropriate data for a given window 
(the number of these "data" lines equals the number of 
windows that are contained per sequence), and, optionally, 
after a blank line, annotation data. 



Analysis of the sequence is done in the following stages: 
Using script gc_plot.pl, we generate the plot of the GC 
content and number of CpG dinucleotides per AI. The CpG 
dinucleotide density is weighted by adding 0.25 to the 
dinucleotide count for each CpG dinucleotide that is found 
within 20 bp of another. This makes CpG islands more 
apparent as peaks in CpG dinucleotide density plots. The 
script also produces the graphic plot of the GC and CpG 
content and, if available, can annotate the plot with features 
from the output of the count_gene.pl script (making it easier 
to correlate changes in GC and CpG content with sequence 
features). 

The sequence is analyzed for repeats using the publicly 
available RepeatMasker program (Smit and Green, http:// 
repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl). 
RepeatMasker output files are saved. Masked sequence is used 
for searches of public and proprietary databases. Currently, GC 
employs the NCBI version of BLAST (ftp://ncbi.nlm.nih.gov/ 
blast/). Sequence is compared to nonredundant, HTGS, dbSTS, 
and dbEST divisions of GenBank. Sequence similarity criteria 
are set to reduce the probability of identifying ESTs from 
members of closely related gene families (cutoff of expect 
score 10-20). 

Optionally, masked sequence is searched against a data- 
base containing syntenic sequences of model organisms 
(in our case, PolH mouse sequence from syntenic region 
of mouse chromosome 17 [13]). Count_gene.pl and 
count_homol.pl are used to analyze output of the BLAST 
searches, creating a list of the number of relevant hits per 
AI. Count_gene.pl also generates a first draft of sequence 
annotation data, by capturing all the database hits that ex- 
ceed a user-selectable threshold in length. If desired, this 
annotation can be extended and updated by the user manu- 
ally. We captured the exact coordinates of regions of iden- 
tity of database hits used for annotation. This information 
proved to be invaluable for analysis of the gene relation- 
ships, because the alignment of cDNA sequence to genomic 
sequence automatically yields intron- exon organization of 
the corresponding gene. 

Finally, graph.pl is used to gather information produced 
by gc_plot.pl (CpG distribution data), RepeatMasker (repeat 
distribution data), count_gene.pl (annotation and distribu- 
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tion of database hits), and count_homol.pl (distribution of 
conserved regions) and produce a graphical summary. Cur- 
rently we are working on the extension of graph.pl capabil- 
ities (to make output interactive and to add capability to 
include gene expression and copy number data from array- 
based experiments). The first version of the Genome Cryp- 
tographer software is accessible at http://kinase.ucsf.edu/gc. 

Splice site analysis 

Computer analysis of the strength of splice sites was 
performed using programs from the Individual Information 
package of T. Schneider [25]. Values of the individual 
information variable R x (b, I, j) are calculated for each base 
and position of the selected sequence, j, in the domain of 
interest. In this situation, R&b, I, j) is the difference in 
uncertainty before and after binding of a specific base, b, at 
a specific position, /, relative to the origin of an acceptor or 
donor sequence,/ 

These uncertainties are based on probability estimates. 
Probabilities are estimated from the relative frequencies, 
J[b, /), of occurrence of bases, b, at specified positions, I, 
relative to the splice site origin in a set of known splice sites. 
Weighting matrices based on the relative frequencies of 
bases within a specified domain of the location have been 
constructed from a collection of more than 1700 aligned 
sequences from known acceptor and donor sites [26]. En- 
tries of the weighting matrices are 

R iw (b, I) = 2- (-\og 2 (Ab, /))) + e(n, /), (1) 

where e(n, I) is a small-sample error correction for n sam- 
ples at position /. 

The R K value of a site at a selected location,/ is the sum 
of the individual R { values of a sequence of bases over a 
restricted domain about that location. Symbolically, 

*iC0 = S /=sitc _ domain 2fcHacgt> s(b, IJ) R iw (b, /), (2) 

where the function s(b t I, j) = 1 specifies base b at position 
/ for sequence j and is 0 otherwise. For acceptors in human 
DNA the site domain is -25 to +2; for donors, it is -3 to 
+6. Given any specific genomic sequence, the site sequence 
is determined entirely by the position of an origin, or other 
offset, of the site in the genomic sequence. 

The values of R { are normally expressed in bits (binary 
digits). One bit is the amount of information needed to 
distinguish between two choices, 2 bits are needed to choose 
one of four choices, etc. Reasons for choosing this func- 
tional form and these units are discussed in Schneider [25] 
and Shannon [27,28]. 

Sequence walkers illustrate graphically each R { contribu- 
tion of the bases to the acceptor or donor site at a location 
[29]. On these plots, bases that contribute positively to the 
R x sum point upward, those that contribute negatively point 
down. The height of the letter is proportional to the infor- 
mation contribution of that base. The R { sum is given along 
with the type of site for which it is calculated. 



For more information on individual information, se- 
quence logos, information theory, and related topics, see the 
Schneider Lab Web page, http://www.lecb.ncifcrf.gov/ 
—toms/. 

Programs used were mkdb, dbbk, catal, delila, scan, 
exon, and lister. These are a few of the members of the large 
DELILA package [29]. The programs were initially run on 
a Sun SPARCstation 2 under Sun OS 5.5. Recent runs were 
done on a Sun Blade 1000 using Solaris 8. 

Mkdb, dbbk, and catal convert genomic DNA data from 
GenBank format to delila format. Delila processes the data, 
selecting pieces and providing mutations. Scan locates 
splice sites with specified properties. Lister prepares the 
walker plots with splice sites and exons displayed. The 
panels of Fig. 4 were generated this way. 

Fig. 3 gives a different perspective. For clarity of display, 
the bit numbers were displayed as vertical bars at the ap- 
propriate sites through the genomic regions, and only those 
values occurring at the sites of known splice sites were used. 

Web site references 

Genome Cryptographer: http://kinase.ucsf.edu/gc. Col- 
lins and Volk. 

NIH BLAST Web site: ftp://ncbi.nlm.nih.gov/blast/. 

RepeatMasker Web page: http://repeatmasker. genome. 
washington.edu/cgi-bin/RM2_req.pl. Smit and Green. 

Schneider Laboratory Web page: http://www.lecb. 
ncifcrf.gov/~toms/. Molecular information theory and. the 
theory of molecular machines. 
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